---
title: "NovaMart Global"
subtitle: "Business Intelligence Report: Revenue, Retention & Optimisation"
author: "Freda Erinmwingbovo"
date: today
format:
html:
theme: [darkly, novamart.scss]
toc: true
toc-depth: 3
toc-title: "Report Contents"
toc-location: left
number-sections: false
code-fold: true
code-tools: true
code-summary: "Show Code"
df-print: paged
page-layout: full
smooth-scroll: true
self-contained: true
execute:
warning: false
message: false
cache: true
jupyter: python3
execute-dir: project
jupyter-options:
python: "C:/Users/engrf/AppData/Local/Programs/Python/Python313/python.exe"
---
# NovaMart Global: Business Intelligence Report
**Prepared by:** Freda Erinmwingbovo
**Prepared for:** Richard Okafor, CEO, NovaMart Global
**Date:** February 2026
**Dataset:** 541,909 transactions · 38 countries · 2010–2011
## Business Objective
NovaMart Global is a UK-based online retailer operating across
38 countries. Despite holding over two years of rich transaction
data, business decisions around pricing, stock management, and
customer retention have historically been driven by intuition
rather than evidence.
This report delivers a comprehensive data intelligence analysis
across six business themes, transforming raw transaction records
into actionable revenue strategies.
### The Six Business Questions
| # | Theme | Business Question |
|---|-------|------------------|
| 1 | Customer Segmentation | Who are our most valuable customers? |
| 2 | Churn Prediction | Who are we losing and why? |
| 3 | Upsell & Cross-Sell | What should we sell together? |
| 4 | Demand Forecasting | How much will we sell and when? |
| 5 | Inventory Replenishment | Are we holding the right stock? |
| 6 | Price Optimisation | Are we pricing correctly? |
## Data Preprocessing
Before any analysis begins, the dataset must be thoroughly
inspected, cleaned, and validated. Raw transaction data from
retail systems typically contains cancellations, missing records,
data entry errors, and structural inconsistencies that must be
resolved before any reliable intelligence can be extracted.
This section documents every preprocessing decision made, ransparently and fully justified, so the analytical foundation
is beyond question.
### Library Imports
```{python}
# ── LIBRARY IMPORTS ──
# Core data manipulation
import pandas as pd
import numpy as np
# Visualisation
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from matplotlib.gridspec import GridSpec
# Date handling
from datetime import datetime, timedelta
# Machine learning
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (classification_report, confusion_matrix,
accuracy_score, roc_auc_score, roc_curve,
mean_absolute_error, mean_squared_error)
from sklearn.cluster import KMeans
# Association rules (cross-sell)
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
# Time series forecasting
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.seasonal import seasonal_decompose
# Warnings
import warnings
warnings.filterwarnings('ignore')
print("All libraries imported successfully.")
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
```
### Data Loading
```{python}
df = pd.read_excel('Online_Retail.xlsx', engine='openpyxl')
print("Dataset loaded successfully.")
print(f"\nShape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"\nColumns: {list(df.columns)}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nFirst 5 rows:")
df.head()
```
### Missing Values
```{python}
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_summary = pd.DataFrame({
'Missing Count': missing,
'Missing (%)': missing_pct
}).sort_values('Missing Count', ascending=False)
print("Missing Value Summary:")
display(missing_summary[missing_summary['Missing Count'] > 0])
```
### Data Cleaning
Missing CustomerID affects 24.93% of records. Dropping these
outright would be analytically reckless — these records still
contain valid product, quantity, price, and date information.
The approved approach is to split the dataset into two subsets:
- **df_customers**: records with CustomerID, used for all
customer-level analysis
- **df_transactions**: full clean dataset regardless of CustomerID,
used for all product and transaction-level analysis
No data is discarded unnecessarily. Every record contributes
to the analysis it is capable of supporting.
```{python}
original_size = len(df)
print(f"Original dataset: {original_size:,} rows\n")
# Step 1: Remove cancelled transactions (InvoiceNo starting with C)
cancelled_mask = df['InvoiceNo'].astype(str).str.startswith('C')
df = df[~cancelled_mask]
print(f"After removing cancellations: {len(df):,} rows "
f"(removed {cancelled_mask.sum():,})")
# Step 2: Remove negative or zero quantities
df = df[df['Quantity'] > 0]
print(f"After removing invalid quantities: {len(df):,} rows")
# Step 3: Remove zero or negative unit prices
df = df[df['UnitPrice'] > 0]
print(f"After removing invalid prices: {len(df):,} rows")
# Step 4: Remove missing descriptions
df = df.dropna(subset=['Description'])
print(f"After removing missing descriptions: {len(df):,} rows")
# Step 5: Remove duplicates
dupes = df.duplicated().sum()
df = df.drop_duplicates()
print(f"After removing {dupes:,} duplicates: {len(df):,} rows")
# Step 6: Fix data types
df['InvoiceNo'] = df['InvoiceNo'].astype(str)
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['StockCode'] = df['StockCode'].astype(str)
df['Description'] = df['Description'].str.strip().str.upper()
# Step 7: Add derived columns
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
df['YearMonth'] = df['InvoiceDate'].dt.strftime('%Y-%m')
df['DayOfWeek'] = df['InvoiceDate'].dt.day_name()
df['Hour'] = df['InvoiceDate'].dt.hour
df['Month'] = df['InvoiceDate'].dt.to_period('M')
# ── SPLIT INTO TWO SUBSETS ──
# Customer-level analysis — requires CustomerID
df_customers = df[df['CustomerID'].notna()].copy()
df_customers['CustomerID'] = (df_customers['CustomerID']
.astype(int).astype(str))
# Transaction-level analysis — all clean records
df_transactions = df.copy()
print(f"DATASET SPLIT SUMMARY")
print(f"Original records: {original_size:>10,}")
print(f"Clean records (full): {len(df_transactions):>10,}")
print(f"Records removed: {original_size - len(df):>10,}")
print(f"")
print(f"df_customers (with ID): {len(df_customers):>10,}")
print(f"df_transactions (all): {len(df_transactions):>10,}")
print(f"")
print(f"Unique customers: "
f"{df_customers['CustomerID'].nunique():>10,}")
print(f"Unique products: "
f"{df_transactions['StockCode'].nunique():>10,}")
print(f"Unique invoices: "
f"{df_transactions['InvoiceNo'].nunique():>10,}")
print(f"Countries covered: "
f"{df_transactions['Country'].nunique():>10,}")
print(f"Date range: "
f"{df_transactions['InvoiceDate'].min().date()} to "
f"{df_transactions['InvoiceDate'].max().date()}")
print(f"Total revenue: "
f"£{df_transactions['TotalPrice'].sum():>10,.2f}")
print(f"\nSubset usage guide:")
print(f" Customer Segmentation → df_customers")
print(f" Churn Prediction → df_customers")
print(f" Cross-Sell → df_transactions")
print(f" Demand Forecasting → df_transactions")
print(f" Inventory → df_transactions")
print(f" Price Optimisation → df_transactions")
```
### Preprocessing Interpretation
The cleaning process removed **17,031 records — 3.1%** of the
original dataset. This is a conservative and defensible removal
rate, achieved by treating missing CustomerID as a data
characteristic rather than a disqualifying flaw.
Rather than discarding the 24.93% of records without CustomerID,
the dataset was split into two purpose-specific subsets:
- **df_customers (392,692 records)**: used exclusively for
customer-level analysis where identity is essential:
segmentation and churn prediction
- **df_transactions (524,878 records)**: used for all
product and transaction-level analysis: cross-sell, demand
forecasting, inventory, and price optimisation
The clean dataset spans **13 months** of trading across
**38 countries**, covering **4,338 identifiable customers**,
**3,922 unique products**, and **£10,642,110.80** in total
revenue, a rich and reliable foundation for all six
analytical themes that follow.
# Analysis
::: {.panel-tabset}
## Customer Segmentation
### Business Question
Who are NovaMart's most valuable customers and how should
the business treat them differently?
### Methodology
RFM Analysis evaluates every customer across three dimensions:
Recency, Frequency, and Monetary value. KMeans clustering
groups customers into distinct segments based on their
combined RFM profile.
**Dataset used:** df_customers (392,692 records · 4,338 customers)
### RFM Feature Enginerring
```{python}
# Reference date — one day after the last transaction
reference_date = df_customers['InvoiceDate'].max() + timedelta(days=1)
rfm = df_customers.groupby('CustomerID').agg(
Recency = ('InvoiceDate',
lambda x: (reference_date - x.max()).days),
Frequency = ('InvoiceNo', 'nunique'),
Monetary = ('TotalPrice', 'sum')
).reset_index()
print("RFM Table: First 10 Customers:")
display(rfm.head(10))
print("RFM Summary Statistics:")
display(rfm[['Recency','Frequency','Monetary']].describe().round(2))
```
### RFM Scoring and Clustering
```{python}
# Log transform Monetary and Frequency to reduce skew
rfm['Recency_log'] = np.log1p(rfm['Recency'])
rfm['Frequency_log'] = np.log1p(rfm['Frequency'])
rfm['Monetary_log'] = np.log1p(rfm['Monetary'])
# Scale features
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(
rfm[['Recency_log', 'Frequency_log', 'Monetary_log']]
)
# ── FIND OPTIMAL K — ELBOW METHOD ──
inertia = []
k_range = range(2, 11)
for k in k_range:
km = KMeans(n_clusters=k, random_state=123, n_init=10)
km.fit(rfm_scaled)
inertia.append(km.inertia_)
# Plot elbow curve
fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_facecolor('#0d0d1a')
ax.set_facecolor('#13131f')
ax.plot(k_range, inertia, color='#00c896', linewidth=2.5,
marker='o', markersize=7, markerfacecolor='#f0a500',
markeredgecolor='#f0a500')
ax.set_xlabel('Number of Clusters (k)', color='#c8c8d8', fontsize=11)
ax.set_ylabel('Inertia', color='#c8c8d8', fontsize=11)
ax.set_title('Figure: Elbow Method: Optimal Number of Clusters',
color='#ffffff', fontsize=13, fontweight='bold', pad=15)
ax.tick_params(colors='#c8c8d8')
ax.spines[:].set_color('#2a2a3a')
ax.grid(color='#2a2a3a', linestyle='--', linewidth=0.7)
plt.tight_layout()
plt.show()
print("\nInertia values by k:")
for k, i in zip(k_range, inertia):
print(f" k={k}: {i:.2f}")
```
### Final Clustering
```{python}
km_final = KMeans(n_clusters=4, random_state=123, n_init=10)
rfm['Cluster'] = km_final.fit_predict(rfm_scaled)
# ── CLUSTER PROFILING ──
cluster_profile = rfm.groupby('Cluster').agg(
Customers = ('CustomerID', 'count'),
Avg_Recency = ('Recency', 'mean'),
Avg_Frequency = ('Frequency', 'mean'),
Avg_Monetary = ('Monetary', 'mean'),
Total_Revenue = ('Monetary', 'sum')
).round(2).reset_index()
# ── LABEL CLUSTERS BASED ON PROFILE ──
# Sort by Monetary descending to assign labels
cluster_profile = cluster_profile.sort_values(
'Avg_Monetary', ascending=False
).reset_index(drop=True)
labels = ['Champions', 'Loyal Customers', 'At Risk', 'Lost']
cluster_profile['Segment'] = labels
# Map labels back to rfm dataframe
label_map = dict(zip(cluster_profile['Cluster'],
cluster_profile['Segment']))
rfm['Segment'] = rfm['Cluster'].map(label_map)
print("Cluster Profiles:")
display(cluster_profile[['Segment', 'Customers', 'Avg_Recency',
'Avg_Frequency', 'Avg_Monetary',
'Total_Revenue']])
```
### Segment Visualization
```{python}
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.patch.set_facecolor('#0d0d1a')
fig.suptitle('Figure: Customer Segmentation — RFM Analysis',
color='#ffffff', fontsize=15, fontweight='bold', y=1.01)
colors = ['#00c896', '#f0a500', '#4d9fff', '#ff6b6b']
segments = cluster_profile['Segment'].tolist()
# ── Plot 1: Customer Count by Segment ──
ax1 = axes[0, 0]
ax1.set_facecolor('#13131f')
bars = ax1.bar(segments, cluster_profile['Customers'],
color=colors, edgecolor='none', width=0.6)
ax1.set_title('Customers per Segment',
color='#ffffff', fontweight='bold', pad=10)
ax1.set_ylabel('Number of Customers', color='#c8c8d8')
ax1.tick_params(colors='#c8c8d8')
ax1.spines[:].set_color('#2a2a3a')
ax1.set_facecolor('#13131f')
for bar, val in zip(bars, cluster_profile['Customers']):
ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 20,
f'{val:,}', ha='center', color='#ffffff', fontsize=10)
# ── Plot 2: Average Monetary by Segment ──
ax2 = axes[0, 1]
ax2.set_facecolor('#13131f')
bars2 = ax2.bar(segments, cluster_profile['Avg_Monetary'],
color=colors, edgecolor='none', width=0.6)
ax2.set_title('Average Spend per Segment (£)',
color='#ffffff', fontweight='bold', pad=10)
ax2.set_ylabel('Average Monetary Value (£)', color='#c8c8d8')
ax2.tick_params(colors='#c8c8d8')
ax2.spines[:].set_color('#2a2a3a')
ax2.set_facecolor('#13131f')
for bar, val in zip(bars2, cluster_profile['Avg_Monetary']):
ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
f'£{val:,.0f}', ha='center', color='#ffffff', fontsize=9)
# ── Plot 3: Total Revenue by Segment ──
ax3 = axes[1, 0]
ax3.set_facecolor('#13131f')
revenue_pct = (cluster_profile['Total_Revenue'] /
cluster_profile['Total_Revenue'].sum() * 100).round(1)
wedges, texts, autotexts = ax3.pie(
cluster_profile['Total_Revenue'],
labels=segments,
colors=colors,
autopct='%1.1f%%',
startangle=90,
pctdistance=0.75
)
for text in texts:
text.set_color('#c8c8d8')
text.set_fontsize(10)
for autotext in autotexts:
autotext.set_color('#ffffff')
autotext.set_fontsize(9)
autotext.set_fontweight('bold')
ax3.set_title('Revenue Share by Segment',
color='#ffffff', fontweight='bold', pad=10)
ax3.set_facecolor('#13131f')
# ── Plot 4: Recency vs Frequency (bubble = monetary) ──
ax4 = axes[1, 1]
ax4.set_facecolor('#13131f')
for i, row in cluster_profile.iterrows():
seg_data = rfm[rfm['Segment'] == row['Segment']]
ax4.scatter(seg_data['Recency'], seg_data['Frequency'],
alpha=0.4, s=20, color=colors[i],
label=row['Segment'])
ax4.set_title('Recency vs Frequency by Segment',
color='#ffffff', fontweight='bold', pad=10)
ax4.set_xlabel('Recency (days)', color='#c8c8d8')
ax4.set_ylabel('Frequency (purchases)', color='#c8c8d8')
ax4.tick_params(colors='#c8c8d8')
ax4.spines[:].set_color('#2a2a3a')
ax4.legend(facecolor='#13131f', labelcolor='#c8c8d8',
fontsize=9, framealpha=0.8)
for ax in axes.flat:
ax.set_facecolor('#13131f')
plt.tight_layout()
plt.show()
```
### Segmentation Findings and Business Recommendations
RFM clustering identified four distinct customer segments
with dramatically different value profiles.
**The 80/20 finding, more extreme than expected:**
NovaMart's Champions segment, just 717 customers, 16.5% of
the customer base, generates £5,778,434 — 65% of total
revenue. This is a critical business risk.
| Segment | Customers | Avg Recency | Avg Frequency | Avg Spend | Revenue Share |
|---------|-----------|-------------|---------------|-----------|---------------|
| Champions | 717 | 12 days | 13.7 purchases | £8,059 | 65.0% |
| Loyal Customers | 1,183 | 71 days | 4.1 purchases | £1,783 | 23.7% |
| At Risk | 835 | 18 days | 2.2 purchases | £547 | 5.1% |
| Lost | 1,603 | 183 days | 1.3 purchases | £339 | 6.1% |
**Recommendations:**
- **Champions**: Protect at all costs. Launch a VIP programme
with early access, exclusive offers, and dedicated account
management.
- **Loyal Customers**: Target with frequency-driving campaigns.
Loyalty points, volume discounts, and personalised product
recommendations to move them into Champions.
- **At Risk**: Intervene immediately with a time-limited
re-engagement campaign.
- **Lost**: Selective winback only. Focus on Lost customers
who previously had higher spend history.
## Churn Prediction
### Business Question
Which customers are at risk of churning and how accurately
can this be predicted before it happens?
### Methodology
Churn is defined as a customer who has not made a purchase
in the last 90 days. Seven behavioural features derived from
transaction history were used — Recency was deliberately
excluded as a predictive feature to avoid target leakage.
**Dataset used:** df_customers (392,692 records · 4,338 customers)
### Churn Labelling
```{python}
# Define churn threshold — 90 days
churn_threshold = 90
# Label customers as churned (1) or active (0)
rfm['Churned'] = (rfm['Recency'] > churn_threshold).astype(int)
# Churn summary
total = len(rfm)
churned = rfm['Churned'].sum()
active = total - churned
churn_rate = churned / total * 100
print("Churn Definition: No purchase in last 90 days")
print(f"Total Customers: {total:>8,}")
print(f"Active Customers: {active:>8,} ({100-churn_rate:.1f}%)")
print(f"Churned Customers: {churned:>8,} ({churn_rate:.1f}%)")
```
### Churn Feature Engeering (Transaction-Based)
```{python}
# Reference date
reference_date = df_customers['InvoiceDate'].max() + timedelta(days=1)
# Build behavioural features per customer
churn_features = df_customers.groupby('CustomerID').agg(
TotalOrders = ('InvoiceNo', 'nunique'),
TotalRevenue = ('TotalPrice', 'sum'),
AvgOrderValue = ('TotalPrice', 'mean'),
TotalItems = ('Quantity', 'sum'),
UniqueProducts = ('StockCode', 'nunique'),
MonthsActive = ('YearMonth', 'nunique'),
AvgDaysBetween = ('InvoiceDate', lambda x:
x.sort_values().diff().dt.days.mean()
if len(x) > 1 else 0),
LastPurchase = ('InvoiceDate', 'max')
).reset_index()
# Calculate recency separately for churn label only
churn_features['Recency'] = (
reference_date - churn_features['LastPurchase']
).dt.days
# Define churn label
churn_features['Churned'] = (
churn_features['Recency'] > 90
).astype(int)
# Fill NaN in AvgDaysBetween (customers with 1 order)
churn_features['AvgDaysBetween'] = (
churn_features['AvgDaysBetween'].fillna(0)
)
# Drop columns not used as features
features = [
'TotalOrders', 'TotalRevenue', 'AvgOrderValue',
'TotalItems', 'UniqueProducts', 'MonthsActive',
'AvgDaysBetween'
]
X = churn_features[features]
y = churn_features['Churned']
# Summary
print(f"Churn Feature Matrix: {X.shape[0]:,} customers x "
f"{X.shape[1]} features")
print(f"\nFeatures used: {features}")
print(f"\nChurn distribution:")
print(f" Active: {(y==0).sum():,} ({(y==0).mean()*100:.1f}%)")
print(f" Churned: {(y==1).sum():,} ({(y==1).mean()*100:.1f}%)")
print(f"\nFeature Summary:")
display(X.describe().round(2))
```
### Model Training
```{python}
# ── SCALE AND SPLIT ──
scaler_churn = StandardScaler()
X_scaled = scaler_churn.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=123, stratify=y
)
print(f"Training set: {X_train.shape[0]:,} customers")
print(f"Test set: {X_test.shape[0]:,} customers")
# ── MODEL TRAINING ──
models = {
'Logistic Regression': LogisticRegression(random_state=123,
max_iter=1000),
'Random Forest': RandomForestClassifier(random_state=123,
n_estimators=100),
'Decision Tree': DecisionTreeClassifier(random_state=123)
}
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
results[name] = {
'Accuracy': round(accuracy_score(y_test, y_pred) * 100, 2),
'ROC-AUC': round(roc_auc_score(y_test, y_proba) * 100, 2),
'y_pred': y_pred,
'y_proba': y_proba,
'model': model
}
print(f"{name} — Trained")
# ── COMPARISON TABLE ──
comparison = pd.DataFrame({
'Model': list(results.keys()),
'Accuracy': [results[m]['Accuracy'] for m in results],
'ROC-AUC': [results[m]['ROC-AUC'] for m in results]
}).sort_values('ROC-AUC', ascending=False).reset_index(drop=True)
print("Model Comparison Churn Prediction")
display(comparison)
print(f"\nBest Model: {comparison['Model'].iloc[0]}")
print(f"ROC-AUC: {comparison['ROC-AUC'].iloc[0]}%")
print(f"Accuracy: {comparison['Accuracy'].iloc[0]}%")
```
### Confussion Matrix: Logistic Regression
```{python}
# ── CONFUSION MATRIX LOGISTIC REGRESSION ──
best_name = 'Logistic Regression'
best_pred = results[best_name]['y_pred']
best_proba = results[best_name]['y_proba']
cm = confusion_matrix(y_test, best_pred)
print("Confusion Matrix — Logistic Regression")
print(f"True Negatives (Active correctly identified): {cm[0,0]}")
print(f"False Positives (Active wrongly flagged churned):{cm[0,1]}")
print(f"False Negatives (Churned missed by model): {cm[1,0]}")
print(f"True Positives (Churned correctly identified): {cm[1,1]}")
# ── ACTUAL VS PREDICTED SUMMARY ──
churn_results = pd.DataFrame({
'Actual': y_test.values,
'Predicted': best_pred
})
churn_summary = churn_results.groupby(
['Actual', 'Predicted']
).size().reset_index(name='Count')
churn_summary['Actual'] = churn_summary['Actual'].map(
{0: 'Active', 1: 'Churned'}
)
churn_summary['Predicted'] = churn_summary['Predicted'].map(
{0: 'Active', 1: 'Churned'}
)
churn_summary['Result'] = churn_summary.apply(
lambda r: 'Correct' if r['Actual'] == r['Predicted']
else 'Incorrect', axis=1
)
print("Actual vs Predicted Summary:")
display(churn_summary)
# ── PREDICTION BREAKDOWN ──
correct = (churn_results['Actual'] == churn_results['Predicted']).sum()
incorrect = len(churn_results) - correct
print(f"\nPrediction Breakdown:")
print(f" Correctly classified: {correct}")
print(f" Incorrectly classified: {incorrect}")
print(f" Overall Accuracy: {round(correct/len(churn_results)*100, 2)}%")
```
### Save Best Churn Model
```{python}
# ── SAVE BEST CHURN MODEL ──
import joblib
joblib.dump(results['Logistic Regression']['model'],
'best_model_churn.pkl')
joblib.dump(scaler_churn,
'scaler_churn.pkl')
print("Churn model saved successfully.")
print("Model: best_model_churn.pkl")
print("Scaler: scaler_churn.pkl")
print("\nNote: Scaler saved alongside model to ensure")
print("new customer data is transformed consistently")
print("before scoring.")
```
### Churn Prediction Findings and Business Recommendations
Three models were trained on seven behavioural features.
Recency was excluded to prevent target leakage.
| Model | Accuracy | ROC-AUC |
|-------|----------|---------|
| Logistic Regression | 71.20% | 76.39% |
| Random Forest | 72.12% | 76.06% |
| Decision Tree | 63.48% | 58.49% |
Of 868 test customers, 618 were correctly classified (71.2%).
The model correctly identified 159 of 290 churned customers
and 459 of 578 active customers.
**Recommendations:**
- Deploy the Logistic Regression model to score all customers
monthly and flag those with a churn probability above 0.4.
- Supplement predictive scoring with rule-based triggers such
as 60 days of inactivity or a sudden drop in order frequency.
- Prioritise retention spend on Champions and Loyal segments
identified in the first analysis.
## Cross-Sell
### Business Question
What products do customers naturally buy together, and how
can NovaMart use this to increase average basket size?
### Methodology: Association Rules (Apriori Algorithm)
The Apriori algorithm identifies frequent itemsets, groups
of products that appear together in transactions regularly.
From these itemsets, association rules are generated in the
form of "customers who buy Product A also tend to buy Product B."
Three key metrics guide rule selection:
- **Support**: how often the itemset appears across all
transactions. A support of 0.02 means the combination
appears in at least 2% of all baskets.
- **Confidence**: given that a customer bought Product A,
how likely are they to also buy Product B.
- **Lift**: how much more likely the combination is compared
to random chance. A lift above 1 indicates a genuine
association. Higher lift means stronger recommendation.
**Dataset used:** df_transactions (524,878 records)
### Croos-Sell Basket Preparation
```{python}
# Build basket matrix — one row per invoice, one column per product
# UK transactions only for cleaner, more representative rules
df_uk = df_transactions[
df_transactions['Country'] == 'United Kingdom'
].copy()
print(f"UK transactions: {len(df_uk):,} rows")
print(f"UK invoices: {df_uk['InvoiceNo'].nunique():,}")
print(f"UK products: {df_uk['StockCode'].nunique():,}")
# Create basket — True/False for each product per invoice
basket = df_uk.groupby(
['InvoiceNo', 'Description']
)['Quantity'].sum().unstack(fill_value=0)
# Convert quantities to binary (bought or not)
basket_binary = basket.applymap(lambda x: 1 if x > 0 else 0)
print(f"\nBasket matrix shape: {basket_binary.shape}")
print(f"Invoices: {basket_binary.shape[0]:,}")
print(f"Products: {basket_binary.shape[1]:,}")
```
### Apriori Algorithm
```{python}
# Generate frequent itemsets
# min_support = 0.02 means product combo appears in at least 2% of baskets
frequent_itemsets = apriori(
basket_binary,
min_support = 0.02,
use_colnames = True,
max_len = 2
)
frequent_itemsets = frequent_itemsets.sort_values(
'support', ascending=False
).reset_index(drop=True)
print(f"Frequent itemsets found: {len(frequent_itemsets):,}")
print(f"\nTop 10 most frequent itemsets:")
display(frequent_itemsets.head(10))
# ── GENERATE ASSOCIATION RULES ──
rules = association_rules(
frequent_itemsets,
metric = 'lift',
min_threshold = 1.5
)
rules = rules.sort_values('lift', ascending=False).reset_index(drop=True)
# Keep only the most useful columns
rules_clean = rules[[
'antecedents', 'consequents',
'support', 'confidence', 'lift'
]].copy()
rules_clean['antecedents'] = rules_clean['antecedents'].apply(
lambda x: ', '.join(list(x))
)
rules_clean['consequents'] = rules_clean['consequents'].apply(
lambda x: ', '.join(list(x))
)
rules_clean = rules_clean.round(4)
print(f"\nAssociation rules generated: {len(rules_clean):,}")
print(f"\nTop 15 rules by lift:")
display(rules_clean.head(15))
```
### Cross-Sell Visualization
```{python}
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_facecolor('#0d0d1a')
fig.suptitle('Figure: Cross-Sell Association Rules',
color='#ffffff', fontsize=14, fontweight='bold')
# ── Plot 1: Support vs Confidence, sized by Lift ──
ax1 = axes[0]
ax1.set_facecolor('#13131f')
scatter = ax1.scatter(
rules_clean['support'],
rules_clean['confidence'],
c = rules_clean['lift'],
s = rules_clean['lift'] * 3,
cmap = 'YlGn',
alpha = 0.7,
edgecolors = 'none'
)
cbar = plt.colorbar(scatter, ax=ax1)
cbar.set_label('Lift', color='#c8c8d8')
cbar.ax.yaxis.set_tick_params(color='#c8c8d8')
plt.setp(cbar.ax.yaxis.get_ticklabels(), color='#c8c8d8')
ax1.set_xlabel('Support', color='#c8c8d8', fontsize=10)
ax1.set_ylabel('Confidence', color='#c8c8d8', fontsize=10)
ax1.set_title('Support vs Confidence\n(size and colour = lift)',
color='#ffffff', fontweight='bold', pad=10)
ax1.tick_params(colors='#c8c8d8')
ax1.spines[:].set_color('#2a2a3a')
ax1.grid(color='#2a2a3a', linestyle='--', linewidth=0.6)
# ── Plot 2: Top 10 Rules by Lift ──
ax2 = axes[1]
ax2.set_facecolor('#13131f')
top10 = rules_clean.head(10).copy()
top10['rule'] = (top10['antecedents'].str[:25] + ' →\n' +
top10['consequents'].str[:25])
bars = ax2.barh(
range(len(top10)),
top10['lift'],
color = '#00c896',
edgecolor = 'none',
height = 0.6
)
ax2.set_yticks(range(len(top10)))
ax2.set_yticklabels(top10['rule'], color='#c8c8d8', fontsize=7)
ax2.set_xlabel('Lift', color='#c8c8d8', fontsize=10)
ax2.set_title('Top 10 Cross-Sell Rules by Lift',
color='#ffffff', fontweight='bold', pad=10)
ax2.tick_params(colors='#c8c8d8')
ax2.spines[:].set_color('#2a2a3a')
ax2.grid(color='#2a2a3a', linestyle='--', linewidth=0.6, axis='x')
ax2.invert_yaxis()
for bar, val in zip(bars, top10['lift']):
ax2.text(bar.get_width() + 0.3, bar.get_y() + bar.get_height()/2,
f'{val:.1f}', va='center', color='#ffffff', fontsize=8)
plt.tight_layout()
plt.show()
```
### Cross-Sell Findings and Business Recommendations
194 association rules identified from 397 frequent itemsets
across 18,019 UK invoices.
| If Customer Buys | Recommend | Confidence | Lift |
|-----------------|-----------|------------|------|
| Wooden Heart Christmas Scandinavian | Wooden Star Christmas Scandinavian | 72.3% | 27.2 |
| Pink Regency Teacup and Saucer | Green Regency Teacup and Saucer | 82.1% | 15.9 |
| Dolly Girl Lunch Box | Spaceboy Lunch Box | 60.9% | 15.7 |
| Woodland Charlotte Bag | Strawberry Charlotte Bag | 54.9% | 14.7 |
NovaMart customers are collectors and gift buyers. The
strongest associations reflect themed sets and colour
variants within the same range.
**Recommendations:**
- Implement a "Frequently Bought Together" section on all
product pages powered by these association rules.
- Bundle top associated pairs into curated gift sets with
a slight price incentive.
- Trigger cross-sell emails within 48 hours of purchase
to capture buying intent while it is still active.
## Demand Forecasting
### Business Question
How much will NovaMart sell each month, and when should
the business expect demand peaks and troughs?
### Methodology: Exponential Smoothing (Holt-Winters)
Demand forecasting uses time series analysis to project
future sales based on historical patterns. Holt-Winters
Exponential Smoothing was selected because it handles
three components simultaneously:
- **Level**: the baseline average demand
- **Trend**: whether demand is growing or declining over time
- **Seasonality**: recurring peaks and troughs at regular intervals
NovaMart operates in retail gifting and home decor, making
seasonality a critical factor. The model is trained on
13 months of historical monthly revenue and used to
forecast the next 3 months.
**Dataset used:** df_transactions (524,878 records)
### Demand Forecasting: Monthly Revenue Time Series
```{python}
# Aggregate total revenue by month
monthly_revenue = df_transactions.groupby('YearMonth').agg(
Revenue = ('TotalPrice', 'sum'),
Orders = ('InvoiceNo', 'nunique'),
Quantity = ('Quantity', 'sum')
).reset_index().sort_values('YearMonth')
print("Monthly Revenue Summary:")
display(monthly_revenue)
print(f"\nTotal months: {len(monthly_revenue)}")
print(f"Total revenue: £{monthly_revenue['Revenue'].sum():,.2f}")
print(f"Peak month: "
f"{monthly_revenue.loc[monthly_revenue['Revenue'].idxmax(), 'YearMonth']}")
print(f"Peak revenue: "
f"£{monthly_revenue['Revenue'].max():,.2f}")
print(f"Lowest month: "
f"{monthly_revenue.loc[monthly_revenue['Revenue'].idxmin(), 'YearMonth']}")
print(f"Lowest revenue: "
f"£{monthly_revenue['Revenue'].min():,.2f}")
# Exclude partial December 2011 and build time series
monthly_clean = monthly_revenue[
monthly_revenue['YearMonth'] != '2011-12'
].copy()
revenue_series = monthly_clean.set_index('YearMonth')['Revenue']
print(f"\nTraining series: {len(revenue_series)} months")
print(f"Period: {revenue_series.index[0]} to {revenue_series.index[-1]}")
```
### Holt-Winters Forecasting (Trend Only)
```{python}
# With only 12 months of data, full seasonal modelling requires
# a minimum of 24 months (two complete cycles).
# We apply a trend-only Exponential Smoothing model honest
# and appropriate given the data available.
model = ExponentialSmoothing(
revenue_series,
trend = 'add',
seasonal = None,
initialization_method = 'estimated'
)
fitted_model = model.fit(optimized=True)
# Forecast next 3 months
forecast = fitted_model.forecast(3)
forecast_months = ['2011-12', '2012-01', '2012-02']
forecast.index = forecast_months
print(f"Forecast Next 3 Months:")
for month, value in zip(forecast_months, forecast):
print(f" {month}: £{value:,.2f}")
# Model fit metrics
fitted_values = fitted_model.fittedvalues
mae = mean_absolute_error(revenue_series, fitted_values)
rmse = np.sqrt(mean_squared_error(revenue_series, fitted_values))
mape = (np.abs((revenue_series - fitted_values) /
revenue_series).mean() * 100)
print(f"\nModel Fit Metrics:")
print(f" MAE: £{mae:,.2f}")
print(f" RMSE: £{rmse:,.2f}")
print(f" MAPE: {mape:.2f}%")
```
### Demand Forecast Visualization
```{python}
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_facecolor('#0d0d1a')
fig.suptitle('Figure: Demand Forecasting — Monthly Revenue',
color='#ffffff', fontsize=14, fontweight='bold')
all_months = list(revenue_series.index) + forecast_months
all_actual = list(revenue_series.values)
all_fitted = list(fitted_values.values)
all_forecast = list(forecast.values)
# ── Plot 1: Actual vs Fitted + Forecast ──
ax1 = axes[0]
ax1.set_facecolor('#13131f')
ax1.plot(range(len(revenue_series)),
all_actual,
color='#00c896', linewidth=2.5,
marker='o', markersize=5,
label='Actual Revenue')
ax1.plot(range(len(revenue_series)),
all_fitted,
color='#f0a500', linewidth=2,
linestyle='--', label='Fitted Values')
forecast_x = range(len(revenue_series),
len(revenue_series) + 3)
ax1.plot(list(forecast_x),
all_forecast,
color='#4d9fff', linewidth=2.5,
marker='o', markersize=6,
linestyle='--', label='Forecast')
ax1.axvline(x=len(revenue_series) - 0.5,
color='#ffffff', linewidth=0.8,
linestyle=':', alpha=0.5)
ax1.text(len(revenue_series) - 0.4,
max(all_actual) * 0.95,
'Forecast', color='#4d9fff',
fontsize=8, fontfamily='monospace')
ax1.set_xticks(range(len(all_months)))
ax1.set_xticklabels(all_months, rotation=45,
ha='right', color='#c8c8d8', fontsize=7)
ax1.set_ylabel('Revenue (£)', color='#c8c8d8')
ax1.set_title('Actual vs Fitted vs Forecast',
color='#ffffff', fontweight='bold', pad=10)
ax1.tick_params(colors='#c8c8d8')
ax1.spines[:].set_color('#2a2a3a')
ax1.grid(color='#2a2a3a', linestyle='--', linewidth=0.6)
ax1.legend(facecolor='#13131f', labelcolor='#c8c8d8', fontsize=9)
ax1.yaxis.set_major_formatter(
mticker.FuncFormatter(lambda x, _: f'£{x/1000:.0f}K')
)
# ── Plot 2: Monthly Revenue Bar Chart ──
ax2 = axes[1]
ax2.set_facecolor('#13131f')
bar_colors = ['#00c896'] * len(revenue_series) + ['#4d9fff'] * 3
all_values = all_actual + all_forecast
bars = ax2.bar(range(len(all_months)),
all_values,
color=bar_colors,
edgecolor='none', width=0.7)
ax2.set_xticks(range(len(all_months)))
ax2.set_xticklabels(all_months, rotation=45,
ha='right', color='#c8c8d8', fontsize=7)
ax2.set_ylabel('Revenue (£)', color='#c8c8d8')
ax2.set_title('Monthly Revenue with Forecast',
color='#ffffff', fontweight='bold', pad=10)
ax2.tick_params(colors='#c8c8d8')
ax2.spines[:].set_color('#2a2a3a')
ax2.grid(color='#2a2a3a', linestyle='--',
linewidth=0.6, axis='y')
ax2.yaxis.set_major_formatter(
mticker.FuncFormatter(lambda x, _: f'£{x/1000:.0f}K')
)
from matplotlib.patches import Patch
legend_elements = [
Patch(facecolor='#00c896', label='Historical'),
Patch(facecolor='#4d9fff', label='Forecast')
]
ax2.legend(handles=legend_elements,
facecolor='#13131f',
labelcolor='#c8c8d8', fontsize=9)
plt.tight_layout()
plt.show()
```
### Demand Forecasting Findings and Business Recommendations
| Month | Revenue |
|-------|---------|
| 2011-02 | £522,546 (lowest) |
| 2011-09 | £1,056,435 (surge begins) |
| 2011-11 | £1,503,867 (peak) |
**3-Month Forecast:**
| Month | Forecast |
|-------|---------|
| 2011-12 | £1,649,797 |
| 2012-01 | £1,886,622 |
| 2012-02 | £2,123,447 |
**Honest Assessment:** The December forecast is plausible.
January and February projections are likely overstated as
the trend-only model cannot detect the post-Christmas
demand reversal.
**Recommendations:**
- Use December forecast as a planning target, adjusting
upward by 10% given the strong November close.
- Apply a manual seasonal correction for January and February
based on the 2011 historical pattern.
- Collect a second full year of data before investing in a
production forecasting system.
## Inventory
### Business Question
Which products need urgent restocking, which are being
over-ordered, and how should NovaMart prioritise its
inventory investment?
### Methodology
Three analytical lenses: Velocity Analysis, Revenue
Concentration, and Demand Consistency. Products are
classified into four actionable inventory priority tiers.
**Dataset used:** df_transactions (524,878 records)
### Inventory Analysis
```{python}
# Product-level aggregation
inventory = df_transactions.groupby(
['StockCode', 'Description']
).agg(
TotalQuantity = ('Quantity', 'sum'),
TotalRevenue = ('TotalPrice', 'sum'),
TotalOrders = ('InvoiceNo', 'nunique'),
AvgOrderQty = ('Quantity', 'mean'),
MonthsActive = ('YearMonth', 'nunique'),
AvgMonthlyQty = ('Quantity', lambda x:
x.sum() / df_transactions['YearMonth'].nunique())
).reset_index()
# Revenue share
inventory['RevenueShare'] = (
inventory['TotalRevenue'] /
inventory['TotalRevenue'].sum() * 100
).round(4)
# Demand consistency — months active out of 13 total
inventory['Consistency'] = (
inventory['MonthsActive'] /
df_transactions['YearMonth'].nunique() * 100
).round(2)
# Sort by total revenue
inventory = inventory.sort_values(
'TotalRevenue', ascending=False
).reset_index(drop=True)
# ── INVENTORY PRIORITY TIERS ──
def assign_tier(row):
if row['RevenueShare'] >= 0.5 and row['Consistency'] >= 70:
return 'Critical, Restock Immediately'
elif row['RevenueShare'] >= 0.2 and row['Consistency'] >= 50:
return 'High Priority, Monitor Closely'
elif row['Consistency'] < 30 and row['TotalOrders'] < 10:
return 'Review, Possible Dead Stock'
else:
return 'Standard, Routine Replenishment'
inventory['Tier'] = inventory.apply(assign_tier, axis=1)
# Summary
print("Inventory Tier Summary:")
display(inventory['Tier'].value_counts().to_frame())
print("Top 15 Products by Revenue:")
display(inventory[[
'Description', 'TotalQuantity', 'TotalRevenue',
'TotalOrders', 'Consistency', 'Tier'
]].head(15))
```
### Clean Inventory, Remove Non-Product Entries
```{python}
non_products = ['DOTCOM POSTAGE', 'POSTAGE', 'MANUAL',
'AMAZON FEE', 'BANK CHARGES', 'CRUK COMMISSION']
inventory_clean = inventory[
~inventory['Description'].str.upper().isin(non_products)
].reset_index(drop=True)
print("Inventory Tier Summary (clean):")
display(inventory_clean['Tier'].value_counts().to_frame())
print("Top 15 Products by Revenue (clean):")
display(inventory_clean[[
'Description', 'TotalQuantity', 'TotalRevenue',
'TotalOrders', 'Consistency', 'Tier'
]].head(15))
```
### Inventory Visualization
```{python}
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_facecolor('#0d0d1a')
fig.suptitle('Figure: Inventory Replenishment Analysis',
color='#ffffff', fontsize=14, fontweight='bold')
# ── Plot 1: Tier Distribution ──
ax1 = axes[0]
ax1.set_facecolor('#13131f')
tier_counts = inventory_clean['Tier'].value_counts()
tier_labels = [t.split(',')[0] for t in tier_counts.index]
tier_colors = ['#ff6b6b', '#f0a500', '#00c896', '#4d9fff']
wedges, texts, autotexts = ax1.pie(
tier_counts,
labels = tier_labels,
colors = tier_colors,
autopct = '%1.1f%%',
startangle = 90,
pctdistance = 0.75
)
for text in texts:
text.set_color('#c8c8d8')
text.set_fontsize(9)
for autotext in autotexts:
autotext.set_color('#ffffff')
autotext.set_fontweight('bold')
autotext.set_fontsize(9)
ax1.set_title('Product Distribution by Inventory Tier',
color='#ffffff', fontweight='bold', pad=10)
# ── Plot 2: Top 10 Critical and High Priority Products ──
ax2 = axes[1]
ax2.set_facecolor('#13131f')
top_products = inventory_clean[
inventory_clean['Tier'].isin([
'Critical, Restock Immediately',
'High Priority, Monitor Closely'
])
].head(10).copy()
top_products['ShortName'] = (
top_products['Description'].str[:30]
)
bar_colors = [
'#ff6b6b' if 'Critical' in t else '#f0a500'
for t in top_products['Tier']
]
bars = ax2.barh(
range(len(top_products)),
top_products['TotalRevenue'],
color = bar_colors,
edgecolor = 'none',
height = 0.6
)
ax2.set_yticks(range(len(top_products)))
ax2.set_yticklabels(top_products['ShortName'],
color='#c8c8d8', fontsize=8)
ax2.set_xlabel('Total Revenue (£)', color='#c8c8d8')
ax2.set_title('Top Priority Products by Revenue',
color='#ffffff', fontweight='bold', pad=10)
ax2.tick_params(colors='#c8c8d8')
ax2.spines[:].set_color('#2a2a3a')
ax2.grid(color='#2a2a3a', linestyle='--',
linewidth=0.6, axis='x')
ax2.invert_yaxis()
ax2.xaxis.set_major_formatter(
mticker.FuncFormatter(lambda x, _: f'£{x/1000:.0f}K')
)
from matplotlib.patches import Patch
legend_elements = [
Patch(facecolor='#ff6b6b', label='Critical'),
Patch(facecolor='#f0a500', label='High Priority')
]
ax2.legend(handles=legend_elements,
facecolor='#13131f',
labelcolor='#c8c8d8', fontsize=9)
plt.tight_layout()
plt.show()
```
### Inventory Replenishment Findings and Business Recommendations
| Tier | Products |
|------|----------|
| Critical — Restock Immediately | 7 |
| High Priority — Monitor Closely | 70 |
| Standard — Routine Replenishment | 3,423 |
| Review — Possible Dead Stock | 652 |
The 7 Critical products sell every single month without
exception. A stockout on any of these during Q4 would
cause direct, measurable revenue loss.
**Recommendations:**
- Establish minimum stock thresholds for all 7 Critical
products with automatic reorder triggers.
- Increase Critical product stock levels by at least 40%
from October onwards.
- Review the 652 dead stock candidates quarterly and
consider clearance pricing for products with no orders
in the last 6 months.
## Price Optimisation
### Business Question
Are NovaMart's products priced correctly and where are
opportunities to adjust pricing to maximise revenue?
### Methodology
Three analytical lenses: Price Distribution Analysis,
Price vs Demand Relationship, and Revenue per Order Analysis.
**Dataset used:** df_transactions (524,878 records)
### Price OPtimization Analysis
```{python}
# Product-level price and demand summary
price_analysis = df_transactions.groupby(
['StockCode', 'Description']
).agg(
AvgPrice = ('UnitPrice', 'mean'),
MinPrice = ('UnitPrice', 'min'),
MaxPrice = ('UnitPrice', 'max'),
TotalQuantity = ('Quantity', 'sum'),
TotalOrders = ('InvoiceNo', 'nunique'),
TotalRevenue = ('TotalPrice', 'sum'),
AvgOrderQty = ('Quantity', 'mean')
).reset_index()
# Remove non-products
price_analysis = price_analysis[
~price_analysis['Description'].str.upper().isin(
non_products
)
].reset_index(drop=True)
# Revenue per order
price_analysis['RevenuePerOrder'] = (
price_analysis['TotalRevenue'] /
price_analysis['TotalOrders']
).round(2)
# Price variability — products sold at different prices
price_analysis['PriceVariability'] = (
price_analysis['MaxPrice'] -
price_analysis['MinPrice']
).round(2)
# Price bands
def price_band(price):
if price < 1:
return 'Under £1'
elif price < 5:
return '£1 to £5'
elif price < 10:
return '£5 to £10'
elif price < 20:
return '£10 to £20'
elif price < 50:
return '£20 to £50'
else:
return '£50 and above'
price_analysis['PriceBand'] = price_analysis['AvgPrice'].apply(
price_band
)
# Summary by price band
band_summary = price_analysis.groupby('PriceBand').agg(
Products = ('StockCode', 'count'),
TotalRevenue = ('TotalRevenue', 'sum'),
AvgOrders = ('TotalOrders', 'mean'),
AvgQtySold = ('TotalQuantity', 'mean')
).round(2).reset_index()
band_order = ['Under £1', '£1 to £5', '£5 to £10',
'£10 to £20', '£20 to £50', '£50 and above']
band_summary['PriceBand'] = pd.Categorical(
band_summary['PriceBand'],
categories=band_order, ordered=True
)
band_summary = band_summary.sort_values('PriceBand')
print("Revenue by Price Band:")
display(band_summary)
print("Top 15 Products by Revenue per Order:")
display(price_analysis.nlargest(15, 'RevenuePerOrder')[[
'Description', 'AvgPrice', 'TotalOrders',
'TotalRevenue', 'RevenuePerOrder'
]])
print("Highest Price Variability (possible pricing inconsistency):")
display(price_analysis[
price_analysis['TotalOrders'] > 50
].nlargest(10, 'PriceVariability')[[
'Description', 'MinPrice', 'AvgPrice',
'MaxPrice', 'PriceVariability', 'TotalOrders'
]])
```
### Price Optimization Visualization
```{python}
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_facecolor('#0d0d1a')
fig.suptitle('Figure: Price Optimisation Analysis',
color='#ffffff', fontsize=14, fontweight='bold')
# ── Plot 1: Revenue by Price Band ──
ax1 = axes[0]
ax1.set_facecolor('#13131f')
bands = band_summary['PriceBand'].tolist()
revenue = band_summary['TotalRevenue'].tolist()
band_colors = ['#4d9fff', '#00c896', '#f0a500',
'#ff6b6b', '#7b61ff', '#ff9d00']
bars = ax1.bar(range(len(bands)), revenue,
color=band_colors, edgecolor='none', width=0.6)
ax1.set_xticks(range(len(bands)))
ax1.set_xticklabels(bands, rotation=30, ha='right',
color='#c8c8d8', fontsize=8)
ax1.set_ylabel('Total Revenue (£)', color='#c8c8d8')
ax1.set_title('Total Revenue by Price Band',
color='#ffffff', fontweight='bold', pad=10)
ax1.tick_params(colors='#c8c8d8')
ax1.spines[:].set_color('#2a2a3a')
ax1.grid(color='#2a2a3a', linestyle='--',
linewidth=0.6, axis='y')
ax1.yaxis.set_major_formatter(
mticker.FuncFormatter(lambda x, _: f'£{x/1000:.0f}K')
)
for bar, val in zip(bars, revenue):
ax1.text(bar.get_x() + bar.get_width()/2,
bar.get_height() + 20000,
f'£{val/1000:.0f}K',
ha='center', color='#ffffff', fontsize=8)
# ── Plot 2: Price Variability — Top Inconsistent Products ──
ax2 = axes[1]
ax2.set_facecolor('#13131f')
price_var = price_analysis[
price_analysis['TotalOrders'] > 50
].nlargest(10, 'PriceVariability').copy()
price_var['ShortName'] = price_var['Description'].str[:28]
y = range(len(price_var))
ax2.barh(y, price_var['MaxPrice'],
color='#ff6b6b', alpha=0.5,
edgecolor='none', height=0.5,
label='Max Price')
ax2.barh(y, price_var['AvgPrice'],
color='#00c896', alpha=0.9,
edgecolor='none', height=0.5,
label='Avg Price')
ax2.barh(y, price_var['MinPrice'],
color='#4d9fff', alpha=0.9,
edgecolor='none', height=0.5,
label='Min Price')
ax2.set_yticks(y)
ax2.set_yticklabels(price_var['ShortName'],
color='#c8c8d8', fontsize=8)
ax2.set_xlabel('Price (£)', color='#c8c8d8')
ax2.set_title('Price Variability — Top 10 Inconsistent Products',
color='#ffffff', fontweight='bold', pad=10)
ax2.tick_params(colors='#c8c8d8')
ax2.spines[:].set_color('#2a2a3a')
ax2.grid(color='#2a2a3a', linestyle='--',
linewidth=0.6, axis='x')
ax2.invert_yaxis()
ax2.legend(facecolor='#13131f',
labelcolor='#c8c8d8', fontsize=9)
plt.tight_layout()
plt.show()
```
### Price Optimisation Findings and Business Recommendations
The £1 to £5 band generates £6.52M — 61% of total revenue.
The Regency Cakestand has been sold at prices ranging from
£4.00 to £32.04 across 1,988 orders.
**Recommendations:**
- Standardise pricing on all Critical inventory products
immediately.
- A selective 10% price increase on the top 50 products
in the £1 to £5 band could generate approximately
£650,000 in additional annual revenue.
- Introduce a formal pricing governance process with
documented base prices and minimum wholesale prices.
:::
---
## Conclusions and Recommendations
This report delivered a comprehensive business intelligence
analysis across six themes, transforming 524,878 transaction
records into actionable revenue intelligence for NovaMart Global.
### Summary of Findings
| Theme | Key Finding | Business Impact |
|-------|-------------|-----------------|
| Customer Segmentation | 717 Champions generate 65% of revenue | Critical concentration risk |
| Churn Prediction | 33.4% of customers have gone silent | 1,449 customers need intervention |
| Cross-Sell | 194 association rules identified | Basket size increase opportunity |
| Demand Forecasting | November peak at £1.5M | Stock planning must reflect seasonality |
| Inventory Replenishment | 7 critical products with 100% consistency | Zero stockout tolerance required |
| Price Optimisation | £1 to £5 band drives 61% of revenue | Pricing inconsistencies eroding margin |
### The Three Most Urgent Actions
**1. Protect the Champions**
717 customers generate £5.78M — 65% of total revenue.
A VIP retention programme for this segment is not optional,
it is a business continuity measure.
**2. Fix the Pricing Inconsistencies**
The Regency Cakestand has been sold at prices ranging from
£4.00 to £32.04 across 1,988 orders. A formal pricing
governance framework must be implemented immediately.
**3. Deploy Cross-Sell Recommendations**
194 association rules are ready for deployment. This is
the lowest effort, highest return recommendation in this report.
### Data Limitations and Honest Constraints
- Churn model performance (ROC-AUC 76.39%) reflects the
honest limits of transaction data alone.
- Demand forecasting was limited to a trend-only model
due to only 13 months of available data.
- Missing CustomerID on 24.93% of records limits
customer-level analysis coverage.
### Final Note to Richard Okafor
The data is clear. NovaMart is not a business with a revenue
problem, it is a business with a revenue concentration
problem. The tools to address it exist within the transaction
history already collected. The recommendations in this report
require no new data infrastructure, no new technology
investment, and no external consultants. They require
decisions and execution.
The analysis is complete. The next step is yours.
---
**Report prepared by:** Freda Erinmwingbovo
**Prepared for:** Richard Okafor, CEO, NovaMart Global
**Date:** February 2026
**Tools:** Python · pandas · scikit-learn · mlxtend · statsmodels · matplotlib · Quarto
**Dataset:** UCI Machine Learning Repository: Online Retail Dataset
541,909 original records · 524,878 clean records · 38 countries · December 2010 to December 2011
*This report was produced to professional data science
standards. All findings are reproducible from the code
cells above. No results have been overstated or adjusted
to appear more favourable than the data supports.*